Tag

#AI benchmark

3 articles

ARC-AGI-3 offers $2M to any AI that matches untrained humans, yet every frontier model scores below 1%

The ARC-AGI-3 benchmark challenges AI systems to match untrained human performance in interactive environments, with no frontier model achieving more than 1% success. The test strips away AI's typical advantages, exposing a gap in reasoning and adaptability.

Mar 2624

research

ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings

ServiceNow Research introduces EnterpriseOps-Gym, a high-fidelity benchmark to evaluate agentic planning in realistic enterprise environments. The tool addresses key challenges like long-horizon planning and access controls.

Mar 1734

OpenAI wants to retire the AI coding benchmark that everyone has been competing on

OpenAI plans to retire the SWE-bench Verified benchmark, citing flaws that undermine its validity as a coding performance measure. The move highlights concerns about memorization in AI model evaluations.

Feb 2331